Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Featurizer tuneup (WIP) #488

Closed
wants to merge 22 commits into from
Closed

Conversation

ardunn
Copy link
Contributor

@ardunn ardunn commented Jun 9, 2020

Making minor adjustments to existing featurizers:

Other

@ardunn ardunn mentioned this pull request Jul 8, 2020
10 tasks
@ardunn
Copy link
Contributor Author

ardunn commented Jul 10, 2020

WIP comments for provenance/discussion on major changes

RDF/ReDF:

Both of these featurizers were relatively easily converted to "flat-style" structure featurizers; this is mostly because the distance bins for each are fixed for each structure regardless of the structure or the number of sites. Instead of having separate "distances" and "distribution" bins, the distances are now shown in the feature labels and also stored internally as a user-facing attribute. The distribution in each bin are the features themselves. These featurizers still require no fitting to use yet provide flat output.

df = load_dataset("matbench_jdft2d")
df = df.iloc[:10]
soxi = StructureToOxidStructure(target_col_id="structure", overwrite_data=True)
df = soxi.featurize_dataframe(df, "structure")
erdf = ElectronicRadialDistributionFunction(cutoff=20)
df = erdf.featurize_dataframe(df, "structure", ignore_errors=True)
print(df)

Old output:

                                           structure  electronic radial distribution function
0  [[1.49323139 3.32688406 7.26257785] Hf0+, [3.3...         {"distribution": [0.0, 0.0, ....], "distances": [0.05, 0.10...]}          
1  [[1.85068084 4.37698238 6.9301577 ] As0+, [0. ...         {"distribution": [0.0, 0.0, ....], "distances": [0.05, 0.10...]}          
2  [[ 0.          2.0213325  11.97279555] Ti3+, [...         {"distribution": [0.0, 0.0, ....], "distances": [0.05, 0.10...]}                
3  [[2.39882726 2.39882726 2.53701553] In0+, [0.0...         {"distribution": [0.0, 0.0, ....], "distances": [0.05, 0.10...]}                
...

New output (a few selected columns):

                                           structure  ReDF [0.00000 - 0.05000]A  ReDF [19.25000 - 19.30000]A  ReDF [20.00000 - 20.00000]A
0  [[1.49323139 3.32688406 7.26257785] Hf0+, [3.3...                        0.0                    -2.215358                          0.0
1  [[1.85068084 4.37698238 6.9301577 ] As0+, [0. ...                        0.0                     0.000000                          0.0
2  [[ 0.          2.0213325  11.97279555] Ti3+, [...                        0.0                     0.000000                          0.0
3  [[2.39882726 2.39882726 2.53701553] In0+, [0.0...                        0.0                     0.000000                          0.0
4  [[-1.83484554e-06  1.73300105e+00  2.61675943e...                        0.0                     0.000000                          0.0
5  [[ 3.70891373 -2.28956375  0.30684745] Sb5+, [...                        0.0                    -0.062230                          0.0
6  [[0.9242204  1.67995032 9.23318039] Mo6+, [2.7...                        0.0                     0.312708                          0.0
7  [[ 2.57809927e+00 -2.09669113e-05  3.45302620e...                        0.0                     0.124566                          0.0
8  [[0.         0.         3.74714516] Zr4+, [1.8...                        0.0                     0.000000                          0.0
9  [[1.96098228 0.         2.92357226] Pd4+, [1.9...                        0.0                    -0.498110                          0.0

Another major difference is removing the auto-cutoff determination in ReDF. Previously this was done on an individual basis for each structure based on the max diagonal measurement of the unit cell. However, the difference in cutoff would require fitting to featurize multiple samples, in which case all but the largest-diagonal unit cell would have at least one invalid distance bin. Keeping this auto-cutoff would essentially require a non-flat output. The simplest and most maintainable solution was to remove it; it's reasonable to require users to specify a cutoff before featurizing. They can always remove excess features based on their definition of cutoff after featurizing.


MinimumRelativeDistances

The old MRD returned unequal length vectors for sets of structures with different n_sites. This functionality is totally retained. So all code using MRD as it was should still work exactly as it did before.

df = load_dataset("matbench_dielectric").iloc[:10]
mrd_unequal = MinimumRelativeDistances(flatten=False)
df = mrd_unequal.featurize_dataframe(df, "structure")
print(df)

Old (and new, optional) output

                                           structure             minimum relative distance of each site
0  [[4.29304147 2.4785886  1.07248561] S, [4.2930...  [0.6308738900868119, 0.6308738900868119, 0.630...
1  [[3.95051434 4.51121437 0.28035002] K, [4.3099...  [0.8513671695139504, 0.8973637356353322, 0.897...
2  [[-1.78688104  4.79604117  1.53044621] Rb, [-1...  [0.8812471959905366, 0.8812471959905365, 0.881...
3  [[4.51438064 4.51438064 0.        ] Mn, [0.133...  [0.9543245066092703, 0.9543245066092699, 0.954...
4  [[-4.36731958  6.8886097   0.50929706] Li, [-2...  [0.8440527034415538, 0.8441010470348463, 0.864...
5  [[0.04903784 0.0347292  0.08458426] Ca, [3.740...  [0.9315951904116712, 0.9315951681095045, 0.937...
6  [[5.60800905 1.10640796 1.26351442] Se, [3.762...  [0.8787088076156497, 0.878708788230288, 0.8787...
7  [[ 4.5571751   2.77317895 16.01717369] O, [0.5...  [0.782335013787583, 0.7823350137875834, 0.8672...
8  [[ 3.1542105   1.89817452 12.52003705] Se, [ 3...  [0.9042900048497616, 0.9042900048497614, 0.968...
9  [[-2.13978374 -5.56226053 -1.13086951] Li, [-0...  [0.9032164171063285, 0.9032164171063283, 0.903...

I've also added the option to return equal length vectors (flatten) the output by fitting on a dataset. So MRD is now optionally fittable. MRD can now also optionally return the site-neighbor species which the minimum relative distances are based on.

df = load_dataset("matbench_dielectric").iloc[:10]
mrd_flat = MinimumRelativeDistances(flatten=True)
df = mrd_flat.fit_featurize_dataframe(df, "structure")
print(df)

New output: showing the first 6 feature columns

                                           structure  site #0 min. rel. dist. site #0 specie site #0 neighbor specie(s)  site #1 min. rel. dist. site #1 specie site #1 neighbor specie(s)
0  [[4.29304147 2.4785886  1.07248561] S, [4.2930...                 0.630874             S-                         S-                 0.630874             S-                         S-
1  [[3.95051434 4.51121437 0.28035002] K, [4.3099...                 0.851367             K+                         K+                 0.897364             K+                         K+
2  [[-1.78688104  4.79604117  1.53044621] Rb, [-1...                 0.881247            Rb+                        Rb+                 0.881247            Rb+                        Rb+
3  [[4.51438064 4.51438064 0.        ] Mn, [0.133...                 0.954325           Mn3+                       Mn3+                 0.954325           Mn3+                       Mn3+
4  [[-4.36731958  6.8886097   0.50929706] Li, [-2...                 0.844053            Li+                        Li+                 0.844101            Li+                        Li+
5  [[0.04903784 0.0347292  0.08458426] Ca, [3.740...                 0.931595           Ca2+                       Ca2+                 0.931595           Ca2+                       Ca2+
6  [[5.60800905 1.10640796 1.26351442] Se, [3.762...                 0.878709            Se-                        Se-                 0.878709           Se2-                       Se2-
7  [[ 4.5571751   2.77317895 16.01717369] O, [0.5...                 0.782335            O2-                        O2-                 0.782335            O2-                        O2-
8  [[ 3.1542105   1.89817452 12.52003705] Se, [ 3...                 0.904290           Se2-               (Se2-, Se2-)                 0.904290           Se2-                       Se2-
9  [[-2.13978374 -5.56226053 -1.13086951] Li, [-0...                 0.903216            Li+                        Li+                 0.903216            Li+                        Li+

I don't think these features are particularly useful for ML, but are probably more useful for analysis in this form than the previous form.


GlobalSymmetryFeatures

Now GSM returns the number of symmetry operations along with the other global symm features. Didn't seem like there was already a way to convert the symm ops returned by SpacegroupAnalyzer (translation vector + rotation matrix) to strings representing each symm op (e.g., , 2_1), so the one hot-encoding requested in #253 is not done. Could use @utf or other xstal-symmetry expert code review on whether this is the best way to do things


SOAP

Implements the formaiton energy preset (from dscribe paper) inside of the SOAP class for site featurization. Uses SiteStatsFeaturizer for average structure featurization.


Base automatically changed from master to main March 10, 2021 18:01
ardunn added a commit to ardunn/matminer that referenced this pull request Jun 7, 2021
ardunn added a commit to ardunn/matminer that referenced this pull request Jun 7, 2021
@ardunn
Copy link
Contributor Author

ardunn commented Jun 7, 2021

superseded by #634

@ardunn ardunn closed this Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment